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Abstract 

A natural method for the introduction of second-order derivatives of the log likelihood 
into MCMC algorithms is introduced, based on Taylor expansion of the Langevin 
equation followed by exact solution of the truncated system. 


1 Introduction 


Markov chain Monte Carlo (MCMC) is a highly influential computationally intensive 


method for performing Bayesian inference, with a large variety of applications ([Brooks et al. 
2011 ). Whi l e earl ier MCMC algorithms made use of ra ndom walks in para meter space 


( Cilks et al. . 1995 ). as highlighted in a recent review by Green et al. ( 201 51). the use of 
derivatives can lead to improved algorithms. 

One derivative-based approach is the Metropolis-adjusted Langevin Algorithm, MALA 
(Roberts and Tweediel. 1996a! ). which requires first derivatives of the log likelihood to be 
available. More recently , secon d derivatives have been included via the use of geometric 
approaches (jGirolami and Calderheadl. 120111). the MALA-like versions of which are o f- 
ten highly efficient in applications ( Galderhead and Girolamil . l201ll . Kramer et al. . 20141 ). 
Oth er appr oaches include more general position-dependent MALA (PMALA, analysed by 
Xifara et al. ( 2014 )) although the tuning of these in the absence of an appropriate metric 
for geometric approaches remains a problem. 


This letter introduces a different route to inclusion of second-order derivatives through 
truncated Taylor expansion of the log-likelihood, after which the Langevin equation can 
be solved exactly without further approximation. This algorithm is called HMALA (for 
Hessian-corrected MALA) and leads to a four-fold improvement on the effective sample 
size compared to random walk approaches for a simple example, as well as being able to 
deal with non-convex distributions. 
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2 A Hessian MALA algorithm 


2.1 Local solution of the Langevin equation 

From Roberts and Tweediel ( 1996b ). we know that the following Langevin SDE has sta¬ 
tionary distribution it (subject to technical conditions): 


d 6 = -Vln(7r(0))dt + dW . 


(1) 


Now suppose that we approximate l = ln(7r) in the neighbourhood of some value 6' 
through Taylor expansion 
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l(6 n + x) « l(6 n ) + v x + -x Hx , 


where 
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, v := (yi) , H := (H k 


( 2 ) 


(3) 


Then we can approximate the Langevin SDE in the region of 6 n through the linear 
SDE 

(4) 


dx = - (Hx + v) dt + dW . 


From the results of Archambeau et ah ( 200 7). this linear SDE has Gaussian solution with 
mean m and covariance matrix S obeying 


dm 1 . dS TTO , 

"dt~ ~ 2 <Hm + v ) , ^ = HS+1. 


(5) 


Solving these ODEs over the interval [0, <5] with initial conditions m(0) = 0, S(0) = 0, 
gives the solution 


m = e 2 


-1) h-A , 


s = ( e us - 1) H 


( 6 ) 


This solution is best understood in terms of the power-series definition of the matrix 
exponential 
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(7) 


Substituting <[7|) into d6|) gives a series that is clearly the solution to (|5|). subject to the 
initial conditions. 


In terms of numerical computation of ©, various options are available. These include: 
(i) direct computation of matrix exponentials and inverses using e.g. expm() and inv() in 
MATLAB; (ii) solving (JS]) using standard methods for ODEs such as Runge-Kutta; (iii) 
use of numerical methods for matrix functions to calcul ate 0 i, which is quite well studied 
with a recent example being the methods of Niesen and Wright (2012). For the examples 
considered below, (i) performed well, however it is likely that either (ii) or (iii) would be 
preferable for higher dimensional problems. 
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2.2 Metropolis-Hastings scheme 


The proposal density for HMALA is then 

q{0*\6 n ) = J\T( 6 *\m + 0 n ; S) , 


(8) 


leading to acceptance probability 

a= l A ^W) (9) 

Tr(e n )q(0*\e n ) ' l j 

Standard MALA is recovered from HMALA by using d8j) at first order in <5: 

m = -v5 + o(5) , S = <51 + o(<5) . (10) 


For the random-walk (RW) algorithm, we ignore gradient information entirely and use 
proposal density 

q(o*\o n ) =M{e*\e n -si ). (ii) 


It is worth noting in general that t h e solu tion ([6]) has some similarities with the matrix 
cosh form suggested by Betancourt (2013) for a metric in geometric approaches, (e" H + 
e -aH)H( e Q H j m p 0 rtant differences are, however, that: (i) HMALA is not 

mathematically equivalent to any existing geometric approach; (ii) HMALA can be used 
when the Hessian cannot be integrated over all data meaning the Fisher-Rao metric is 
not available; (iii) HMALA does not require tuning an additional parameter a as in the 
matrix cosh approach. 


3 Examples 

3.1 Negative binomial counts 


Consider sampling from a density proportional to the likelihood function for a model of n 
negative binomial distributed integers, represented as a vector k = ( k a ), leading to 


l(k\p, r ) = ( ln ( r (^a + r )) - ln(fc a !) + k a \n(p) - ln(r(r)) + rln(l - p)) , (12) 

a= 1 


U (k \ U 

d P l = ^2\ — - J— ) > drl = (V’l^a +r) - tpi{r) +ln(l - p)) 

^\P 1 ~PJ 


(13) 


# = -E(? + 

o=l VF 


(1 -pf 


d p d r l = 


—n 


1 ~P 


drl = ^2 + r) - fair)) . (14) 


a= 1 


In this example, the Hessian is easily computed, but its expected value over all data 
(needed to calculate the Fisher-Rao metric) involves infinite sums that do not have known 
closed forms. To produce a likelihood function, 100 integers were simulated with ‘true’ 
parameters r = 6 1 = 1.5 and p = 62 = 0.4. Each of the algorithms RW, MALA and 
HMALA defined above was run on this likelihood function. Results o f calc ulating the 
effective sample size as defined by Neal in the discussion of Kass et al. ( 1998 ) are shown 
in Figure [1] 
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Figure [2] shows how the different algorithms behave at the optimal value of ESS. While RW 
is more efficient than MALA for this system, this is primarily because in two dimensions 
ambitious proposals can be efficient, which would not hold for more complex systems. 
MALA offers conservative local proposals into relatively high-density regions, but HMALA 
is able to use higher-order derivative information to make ambitious proposals into high- 
density regions while achieving the largest ESS by a factor of about four. 


3.2 A Gaussian mixture 

Next, consider the following bimodal Gaussian mixture density: 

7 r(0) = i(Ar(0|/, 1 ;S)+^(0| / r 2 ;S)); Ri = Q ; /*2 = ( 4 ) 5 3 ). (15) 

This example exhibits bimodality, as well as a saddle point in a region of high posterior 
density. As can be seen from Figure[3l this does not affect the ability of HMALA to propose 
efficient moves at the saddle point, the modes, or in regions of low posterior density. 
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Figure 1: Mean and 50% Cl for the effective sample size versus step size 5 for the three 
algoritms. In each case 100 chains of length 10 1 were run. 



Figure 2: 200 proposals generated using different methods for the negative binomial 
count model, for step sizes (RW) 5 = 0.6, (MALA) 5 = 0.006, (HMALA) 5 = 0.5, at three 
different parameter values. 
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Figure 3: 300 proposals generated using different methods for the Gaussian mixture 
model, for step sizes (RW) <5 = 2, (MALA) 6 = 2, (HMALA) 6 = 6, at three different 
parameter values. 
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